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Abstract 

VISUAL LANGUAGE CLASSIFICATION SYSTEM 



Disclosed is a method and system (500) for automated classification of a digital image 
(502). The method analyses the image for the presence of a human face. A determination 
is then made regarding the size of the located face compared to the size of the 
image(Figs. 1A-1G) to classify the image based on the relative size of the face. 
Alternatively, the position of the face within the image can be used to determine the 
classification. With a classified image, particularly forming part of a sequence of 
classified images, editing (514) of the sequence may be performed dependent upon the 
classification to achieve a desired aesthetic effect. The editing may be performed with the 
aid of an editing template (706). 
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VISUAL LANGUAGE CLASSIFICATION SYSTEM 
Technical Field of the Invention 

The present invention relates generally to the classification of image data and, in 
particular, to a form of automated classification that permits an editor to automatically 
generate emotive presentations of the image data. 

Background 

The editing of video of sequences of images (eg. films, video, slide shows), to 
achieve a desired reaction from an audience traditionally requires input from a human 
editor who employs techniques other than the mere sequencing of images over a time line. 
To achieve an understanding by the audience of the intended message or purpose of the 
production, the editor must draw upon human interpretation methods which are then 
applied to moving or still images that form the sequence. 

Film makers use many techniques to obtain a desired meaning from images, such 
techniques including the identification and application of different shot types, both 
moving and still, the use of different camera angles, different lens types and also film 
effects. The process of obtaining meaning from the images that make up the final 
production commences with a story or message that is then translated into a storyboard 
that is used by the film crew and film director as a template. Once the film is captured, 
the editor is then given the resulting images and a shot list for sequencing. It is at an early 
stage of production, when the screen writer translates the written story or script to a 
storyboard, that written language becomes visual language. This occurs due to the 
method by which the audience is told the story and must interpret the message. The 
visual nature of a moving image generally only has dialogue relevant to the character's 
experience and, in most cases, is absent of explicit narrative relative to the story being 
told and the emotional state of the characters within the story. The screen writers must 
therefore generate this additional information using the visual language obtained from 
different shot types. 

Examples of different shot types or images are seen in Figs. 1A to 1G. Fig. 1A 
is representative of an extreme long shot (ELS) which is useful for establishing the 
characters in their environment, and also orientating the audience as to the particular 
location. Fig. IB is representative of a long shot (LS) which is also useful for 
establishing the characters in their environment and orientating the audience as to the 
location. In some instances, an ELS is considered more dramatic than the LS. Fig. 1C is 
representative of a medium long shot (MLS) in which the characters are closer to the 
viewer and indicates, in a transition from a long shot, subjects of importance to the story. 
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Typically for human subjects, an MLS views those subjects from the knees upwards. 
Fig. ID is indicative of a medium shot (MS) in which human characters are generally 
shown from the waist upwards, and the shot assists the viewer interpreting the characters 
reactions to their environment and any particular dialogue taking place. Fig. IE is 
indicative of a medium closeup (MCU) in which human characters are generally shown 
from the chest upwards. The MCU is useful for dialogue and communication 
interpretation including the emotion of the speaking characters. Fig. IF is indicative of a 
closeup (CU) which for human characters frames the forehead and shoulders within the 
shot, and is useful for clear understanding of the emotions associated with any particular 
dialogue. The closeup is used to consciously place the audience in the position of the 
character being imaged to achieve a greater dramatic effect. Fig. 1G is representative of 
an extreme closeup (ECU) formed by a very tight shot of a portion of the face and 
demonstrates beyond the dialogue the full dramatic effect of intended emotion. An ECU 
can be jarring or threatening to the audience in some cases and is often used in many 
thriller or horror movies. It will further be apparent from the sequence of images in 
Figs. 1A to 1G that different shots clearly can display different meaning. For example, 
neither of Figs. IF and 1G indicate that the subject is seen flying a kite, nor do Figs. ID 
or IE place the kite flying subject on a farm indicated by the cow seen in Figs. 1A to 1C. 
Further, it is not apparent from Fig. 1A that the subject is smiling or indeed that the 
subject's eyes are open. 

A photograph or moving image of a person incorporating a full body shot will be 
interpreted by the viewer as having a different meaning to a shot of exactly the same 
person, where the image consists of only a closeup of the face of the subject. A 
full-length body shot is typically interpreted by a viewer as informative and is useful to 
determine the sociological factors of the subject and the relationship of the subject to the 
particular environment. 

An example of this is illustrated in Figs. 2A to 2C which show the same subject 
matter presented with three different shot types. Fig. 2A is a wide shot of the subject 
within the landscape and is informative as to the location, subject and activity taken close 
within the scene. Fig. 2B is a mid-shot of the subject with some of the surrounding 
landscape, and changes the emphasis from the location and activity to the character of the 
subject. Fig. 2C provides a closeup of the subject and draws the audience to focus upon 
the subject. 

Panning is a technique used by screen writers to help the audience participate in 
the absorption of information within a scene. The technique is commonly used with open 
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landscapes or when establishing shots are used in movie productions. A straight shot, 
obtained when the camera does not move, contrasts the effectiveness of a pan. With a 
straight shot, the viewer is forced to move their eyes around the scene, searching for 
information, as opposed to how the pan feeds information to the viewer thus not requiring 
the viewer to seek out a particular message. The movement of the camera within a pan 
directs the audience as to those elements within a scene that should be observed and, 
when used correctly, is intended to mimic the human method of information interpretation 
and absorption. Fig. 3A is an example of a still shot including a number of image 
elements (eg. the sun, the house, the cow, the person and the kite) which the audience 
may scan for information. In film, a still shot is typically used as an establishing shot so 
as to orientate the audience with the location and the relationship to the story. The screen 
writer relies upon this type of shot to make sense of any following scenes. Fig. 3B 
demonstrates an example of a panning technique combined with a zoom, spread amongst 
four consecutive frames. 

Further, differing camera angles, as opposed to direct, straight shots, are often 
used to generate meaning from the subject, such meaning not otherwise being available 
due to dialogue alone. For example, newspaper and television journalists often use 
altered camera angles to solicit propaganda about preferred election candidates. For 
example, interviews recorded from a low angle present the subject as superior to the 
audience, whereas the presentation of the same subject may be altered if taken from a 
high angle to give an inferior interpretation. The same technique is commonly used in 
movie making to dramatically increase the effect of an antagonist and their victim. When 
the victim is shot from a high angle, they not only appear as weak and vulnerable, but the 
audience empathises with the character also experiences their fear. 

Fig. 4A is indicative of an eye level shot which is a standard shot contrasting 
with angles used in other shots and seen in Figs. 4B to 4E. Fig. 4B shows a high angle 
shot and is used to place the subject in an inferior position. Fig. 4C is indicative of a low 
angle shot where the camera angle is held low with the subject projecting them as 
superior. Fig. 4D is indicative of an oblique angle shot where the camera is held off- 
centre influencing the audience to interpret the subject as out of the ordinary, or as 
unbalanced in character. Fig. 4E is representative of a Dutch angle shot which is often 
used to generate a hurried, "no time to waste" or bizarre effect of the subject. The 
audience is conveyed a message that something has gone astray in either a positive or 
negative fashion. 
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There are many other types of images or shots in addition to those discussed 
above that can give insight to the particular story being presented. Tracking shots follow 
the subject allowing the audience the experience of being part of the action. Panning 
gives meaning and designates importance to subjects within a scene as well as providing a 
panoramic view of the scene. A "swish" pan is similar however is used more as a 
transition within a scene, quickly sweeping from one subject to another, thus generating a 
blurred effect. Tilt shots consist of moving the camera from one point up or down, thus 
mimicking the way in which humans evaluate a person or vertical object absorbing the 
information presented thereby. A hand-held shot portrays to the audience that the filming 
is taking place immediately, and if often used to best effect when associated with shots 
taken when the camera is supported (eg. using a tripod or boom). 

To understand the impact visual language has on presenting images in a more 
meaningful way, it is appropriate to compare the results of contemporary motion pictures 
with earlier attempts of film making. Early examples of motion pictures consisted of full 
shots of the characters from the feet upwards reflecting the transition from stage acting. 
For example, the Charlie Chaplin era of film making and story telling contrasts sharply 
with later dramatic, emotion filled motion pictures. Pioneering director D.W. Griffiths 
notably first introduced the use of a pallet of shot types for the purpose of creating drama 
in film. This arose from a desire of the audience to explore the emotional experience of 
the characters of the film. 

Film makers also use other techniques to tell their story, such techniques 
including the choice of lens and film effects. These are all used to encourage the 
audience to understand the intended message or purpose of the production. The audience 
does not need to understand how, or even be aware that, these techniques have been 
applied to the images. In fact, if applied properly with skill, the methods will not even be 
apparent to the audience. 

The skill required by the successful film maker is typically only acquired 
through many years of tuition and practice as well as through the collaboration of many 
experts to achieve a successfully crafted message. Amateur film makers and home video 
makers in contrast often lack the skill and the opportunity to understand or employ such 
methods. However, amateur and home film makers, being well exposed to professional 
film productions have a desire for their own productions to be refined to some extent 
approaching that of professional productions, if not those of big-budget Hollywood 
extravaganzas. Whilst there currently exists many film schools that specialise in courses 
to educate potential film makers with such techniques, attendance at such courses is often 
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prohibitive to the amateur film maker. Other techniques currently available that may 
assist the amateur film maker typically includes software products to aid in the 
sequencing of "images and/or interactive education techniques for tutoring prospective 
film makers. However, current software approaches have not been widely adopted due to 
5 prohibitive costs and skill required for use being excessive for small (domestic) 
productions. 

Time is also a major factor in respect to the current techniques of film editing to 
unskilled editor. Typically, the time taken to plan shots and their sequencing is 
substantial and is typically out of the realistic scope of an average home/amateur film 
10 maker. 

It is therefore desirable to provide a means by which unskilled (amateur) movie 
makers can create visual productions that convey a desired emotive effect to an audience 
without a need for extensive planning or examination of shot types. 

Summary of the Invention 
This need is addressed through the automated classification of images and/or 
shots into various emotive categories thereby permitting editing to achieve a desired 
emotive effect. 

According to a first aspect of the present disclosure, there is provided a method 
for automated classification of a digital image, said method comprising the steps of: 
20 analysing said image for the presence of a human face; 

determining a size of the located face with respect to a size of said image; and 
classifying said image based on the relative size of said face with respect to said 

image. 

According to a second aspect of the present disclosure, there is provided a 
method for automated classification of a digital image, said method comprising the steps 
of: 

analysing said image for the presence of a human face; 

determining a position of the located face with respect to a frame of said image; 

and 

classifying said image based on the relative position of said face with respect to 
said image frame. 

According to another aspect of the present disclosure, there is provided apparatus 
for implementing any one of the aforementioned methods. 
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According to another aspect of the. present disclosure there is provided a 
computer program product including a computer readable medium having recorded 
thereon a computer program for implementing any one of the methods described above. 

Brief Description of the Drawings 

One or more embodiments of the present invention will now be described with 
reference to the drawings, in which: 

Figs. 1A to 1G depict a number of shot ranges used by film makers; 
• Figs. 2A to 2C depict three different shot types used by film makers; 

Figs. 3A and 3B depict the effect of a pan in influencing the emotional state of 
the viewer; 

Figs. 4A to 4E depict various angled camera shots also used by film makers; 
Fjg. 5 is a schematic block diagram representation of an image recording and 
production system; 

Fig. 6 is a schematic block diagram of a general purpose computer system upon 
which the disclosed arrangements can be practiced; and 

Fig. 7 is a flow chart depicting the use of templates for video editing. 
Detailed Description including Best Mode 

Fig. 5 shows a schematic representation of an image recording and production 
system 500 where a scene 502 is captured using an image recording device 504, such as a 
digital video camera or digital still camera. When the scene 502 is captured by a still 
camera, typically a sequence of still images is recorded, in effect complementing the 
sequence of images that might be recorded by a video camera. Associated with the 
capture of the images is the generation of capture data 506 which is output from the 
camera 504 and typically comprises image data 506a, video data 506b, audio data 506c 
and "camera" metadata 506d. The camera metadata 506 represents metadata usually 
generated automatically by the camera or manually entered by the user of the camera. 
Such can include image or frame number, a real-time of capture possibly include a date, 
details regarding camera settings (aperture, exposure etc.) and ambient information such 
as light measurements, to name but a few 

Where appropriate, the capture data 504 recorded by the camera 504 is 
transferred 508 to a mass storage arrangement 510, typically associated with a computing 
system, whereupon the images are made available via an interconnection 520 to a visual 
language classification system 522. The classification system 508 generates metadata 
which is configured for convenient editing by the film maker. The visual language 
classification system 522 outputs classification data 524, configured as further metadata, 
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which is associated with each image and which may be stored within a mass storage 
unit 526. The classification data 524 in the store 526 may be output' to an editing 
module 5 14 which, through accessing the image data via a connection 512 to the 
store .510, provides for the formation of an edited sequence 528 which'may be output to a 
presentation unit 516 for display via a display unit 518, such as a television display, or 
storage in a mass storage device 519. In some implementations, the stores 510, 526. 
and 519 may be integrally formed. 

The classification system 522 performs content analysis to analyse the images 
residing in the store 510. The analysis performed within the classification system 522 is 
configured to provide information gj^jhc intention of the photographer at the time of 
capturing the image or image sequ^ee - Such analysis "may comprise the detection of 
human faces and preferably other visually distinct features including landscape features 
such as the sky, green grass, sandy w^wn earth, or other particular shapes such as 
motor vehicles, buildings and the like^m the image data. Audio analysis where 
appropriate can be used to identify specific, events within the sequence of images such a 
person talking, the passing of a motor car. or the crack of a ball hitting a bat in a sports 
game, such as baseball or cricket, for epaple. The classification system 522 provides 
metadata related to or indicative of the content identified within an image sequence, or at 
the particular image within the sequence. 

One specific example of content analysis that may be applied by classification 
system 522 is that of face detection, that permits identification and tracking of particular 
human subjects in images or sequences thereof. An example of a face detection 
arrangement that may be used in the arrangement of Fig. 5 is that described in US Patent 
No. 5,642,431 -A (Poggio et. al.). Another example is that disclosed in Australian Patent 
Publication No. AU-A-33982/99. Such face detection arrangements typically identify 
within an image frame a group or area of pixels which are skin coloured and thus may 
represent a face, thereby enabling that group or area, and thus the face, to be tagged by 
metadata and monitored. Such monitoring may include establishing a bounding box 
about the height and width of the detected face and thereafter tracking changes or 
30 movement in the box across a number of image frames. 

In the sequence of images of Figs. lAto 1G, the fine content of Figs. 1A and IB 
are generally too small to permit accurate face detection. As such, those frames may be 
classified as non-face unages. However in each of Figs. 1C to 1G, the face of the person 
flying the kite is quite discernible and a significant feature of each respective image. 
35 Thus, those images may be automatically classified as face images, such classification 
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being identified as metadata 524 generated by content analysis performed by the 
• classification system 522 and linked or otherwise associated with the metadata 506d 
provided with the images. 

i 

Further, and in a preferred implementation, the size of the detected face, as a 
, proportion of the overall image size, is used to establish and record the type of shot'. For 
example, simple rules may be established to identify the type of shot. A first rule can be 
that, where a face is detected, but the face is substantially smaller than the image in which 
the face is detected, that image may be classified as a far shot. A similar rule is where a 
face is detected which is sized substantially the same as the image. This may be classified 
as a close-up. An extreme close-up may be where the face occupies the entire image or 
where it is substantially the same size as the image but extends beyond the edges of the 
image. 

In another example, in Fig. 1C, which is a MLS, the face represents about 2% of 
the image. In Fig. ID, the face occupies about 4% of the image, this being a MS. For 
Fig. IE, a MCU delivers the face at a size of about 10% of the image. The CU shot of 
Fig. IF provides the face at about 60% of the image, and for a ECU, the face is in excess 
of about 80% of the image. A suitable set of rules may thus be established to define the 
type of shot relative to the subject, whether or not the subject is a face or some other 
identifiable image structure (eg. cow, house, motor vehicle, etc). Example rules are set 
out below: 

Medium Long Shot (MLS) subject < 2.5% of the image; 

Medium Shot (MS) 2.5% < subject < 10% of the image; 

Medium Close Up (MCU) 1 0% < subject < 30% of the image; 

Close Up (CU) 30% < subject < 80% of the image; and 

Extreme Close Up (ECU) subject > 80% of the image. 
Where desired, the film maker may vary the rules depending on the particular 
type of source footage available, or depending on a particular editing effect desired to be 
achieved. 

Another example of content analysis for classification is camera tilt angle. This 
can be assessed by examining the relative position of a detected face in the image frame. 
For example, as seen in Fig. 4A, where the face is detected centrally within the image 
frame, this may be classified as a eye-level shot. In Fig. 4B, where the subject is 
positioned towards the bottom of the frame, such may be classified as a high angle shot, 
the positioning of the detected face may be correlated with a tiling of the image frame so 
as to provide the desired classification. Tiles within the frame may be pre-classified as 
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eye-level, high shot, low shot, left side, and right side. The location of the detected face 
in certain tiles may then be used to determine an average tile location and thus classify the 
image according to the position of the average face tile. Such an approach may be readily 

applied to the images of Figs. 4A to 4D. 

The Dutch shot of Fig. 4E may be determined by detecting edges within the 
image. Such edges may be detected using any one of a large number of known edge 
detection arrangements. Edges in images often indicate the horizon, or some other 
horizontal edge, or vertical edges such as those formed by building walls. An edge that is 
detected as being substantially non-vertical and non-horizontal may thus indicate a Dutch 
shot. Classification may be performed by comparing an angle of inclination of the 
detected edge with the image frame. Where the angle is about 0 degrees or about 90 
degrees, such may be indicative of an horizon or vertical wall respectively. Such may be 
a traditional shot. However, where the angle of inclination is substantially between these 
values, a Dutch shot may be indicated. Preferred angles of inclination for such detection 
may be between 30 and 60 degrees, but may be determined by the user where desired. 

In an alternative implementation, the visual language classification system can 
permit the user to supplement the classification with other terms relating to the emotive 
message conveyed by the scene. Such manually entered metadata may include terms 
such as "happy", "smiling", "leisure", and "fun" in the example of Figs. 1C to 1G. More 
complicated descriptions may also be entered, such as "kite flying". This manually enter 
metadata that can supplement the automatically generated metadata and be stored with the 

automatically generated metadata. 

As a result of such processing, the store 526 is formed to include metadata 
representative of the content of source images to be used to form the final production. 
The metadata not only includes timing and sequencing (eg. scene number etc.) 
. , information, but also information indicative of the content of the images and shot types 
which can be used as prompts in the editing process to follow. 

With the database 526 formed, the user may then commence editing the selected 
images. This is done by invoking an editing system 514 which extracts the appropriate 
30 images or sequence of images from the store 510. Using the information contained within 
the metadata store 526, the user may conveniently edit particular images. The database 
information may be used to define fade-in and fade-out points, images where a change in 
zoom is desired, points of interest within individual images -which can represent focal 
centres for zooming operations either drtKSth as source or target, amongst many others. 
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Editing performed by the editing system 514 may operate using the 
classifications 524 in a variety of ways. For example, the user may wish to commence an 
image sequence with a long shot, and hence may enter into the system 514 a request for 
all long shots to be listed. The system 514 then interrogates the store 526 to form a pick- 
list of images that have been previously classified as a long shot. The user may then 
select a long shot from the list to commence the edited sequence. The classification thus 
substantially reduces the user's editing time by providing a ready source of searchable 
information regarding each image or shot sequence. Another example is where the user 
wishes to show the emotion "fear" in the faces of the subjects. Since faces are typically 
not detected in any significant detail for anything under a medium shot, a search of the 
store 526 may be made for all medium shots, close-ups and extreme close-ups. A 
corresponding pick list results from which the user can conveniently review a generally 
smaller number of images than the total number available to determine those that show 
"fear". User entered metadata such as "fear" may then supplement the automatically 
generated classification for those images that display such an emotion. 

The automated content analysis of images as discussed above permits the rapid 
processing of sequences of images to facilitate the formation of an enhanced edited result. 
For example, where a video source is provided having 25 frames per second, a 5 second 
shot requires the editing of 125 frames. To perform manual face detection and focal point 
establishment on each frame is time consuming and prone to inconsistent results due to 
human inconsistency. Through automation by content analysis, the positions of the face 
since each frame may be located according to consistently applied rules. All that is then 
necessary is form the user to select the start and end points and the corresponding edit 
functions (eg. zoom values from. 0% at the start, and 60% at the end). 

Metadata analysis of the source material may include the following: 

(i) time code and date data; 

(ii) GPS data; 

(iii) image quality analysis (sharpness, colour, content quality, etc.); 

(iv) original shot type detection; 

(v) object detection and custom object detection (determined by the 
author); 

(vi) movement detection; 

(vii) face detection; 

(viii) audio detection; 

(ix) collision detection; 



529302.doc 



(x) tile (interframe structure) analysis; and 

(xi) user entered metadata. 

The method described above with reference to Fig. 5 is preferably practiced 
using a conventional general-purpose computer system 600, such as that shown in Fig. 6 
wherein the processes of Fig. 5 may be implemented as software, such as an application 
program executing within the computer system 600. The software may be divided into, 
two separate parts; one part for carrying out the classification and editing methods, and 
another part to manage the user interface between the latter and the user. The software 
may be stored in a computer readable medium, including the storage devices described 
below, for example. .The software is loaded into the computer from the computer 
readable medium, and then executed by the computer. ' A computer readable medium 
having such software or computer program recorded on it is a computer program product. 
The use of the computer program product in the computer preferably effects an 
advantageous apparatus for classification and consequential editing of images or 

sequences of images. 

The computer system 600 comprises a computer module 601, input devices such 
as a keyboard 602 and mouse 603, output devices including a printer 615 and a visual 
display device 614 and loud speaker 617. A Modulator-Demodulator (Modem) 
transceiver device 616 is used by the computer module 601 for communicating to and 
from a communications network 620, for example correctable via a telephone line 621 or 
other functional medium. The modem 616 can be used to obtain access to the Internet, 
and other network systems, such as a Local Area Network (LAN) or a Wide Area 
Network (WAN). 

The computer module 601 typically includes at least one processor unit 605, a 
memory unit 606, for example formed from semiconductor random access memory 
(RAM) and read only memory (ROM), input/output (I/O) interfaces including a 
audio/video interface 607, and an I/O interface 613 for the keyboard 602 and mouse 603 
and optionally a joystick (not illustrated), and an interface 608 for the modem 616. A 
storage device 609 is provided and typically includes a hard disk drive 610 and a floppy 
disk drive 61 1. A magnetic tape drive (not illustrated) may also be used. A CD-ROM 
drive 612 is typically provided as a non-volatile source of data. The components 605 to 
613 of the computer module 601, typically communicate via an interconnected bus 604 
and in a manner which results in a conventional mode of operation of the computer 
system 600 known to those in the relevant art. Examples of computers on which the 
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described arrangements can be practised include IBM-PC's and compatibles, Sun 
Sparcstations or alike computer systems evolved therefrom. 

Typically, the application program is resident on the hard disk drive 610 and 
read and controlled in its execution by the processor 605. Intermediate storage of the 
program and any data fetched from the network 620 may be accomplished using the 
semiconductor memory 606, possibly in concert with the hard disk drive 610. In some 
instances, the application program may be supplied to the user encoded on a CD-ROM or 
floppy disk and read via the corresponding drive 612 or 61 1, or alternatively may be read 
by the user from the network 620 via the modem device 616. Still further, the software 
can also be loaded into the computer system 600 from other computer readable medium 
including magnetic tape, a ROM or integrated circuit, a magneto-optical disk, a radio or 
infra-red transmission channel between the computer module 601 and another device, a 
computer readable card such as a PCMCIA card, and the Internet and Intranets including 
e-mail transmissions and information recorded on Websites and the like. The foregoing is 
merely exemplary of relevant computer readable media. Other computer readable media 
may also be used. 

The method described with reference to Fig. 6 may alternatively or additionally 
be implemented in dedicated hardware such as one or more integrated circuits performing 
the functions or sub functions of the system. Such dedicated hardware may include 
graphic processors, digital signal processors, or one or more microprocessors and 
associated memories. For example, specific visual effects such as zoom and image 
interpolation may be performed in specific hardware devices configured for such 
functions. Other processing modules, for example, used for face detection or audio 
processing, may be performed in dedicated DSP apparatus. 

The description above with respect to Fig. 5 indicates how the editing 
system 514 may be used to create an output presentation based upon classifications 
derived from the image content. A further approach to editing may be achieved using a 
template-based approach 700 depicted in the flow chart of Fig. 7, which for example may 
be implemented within the editing system 514. The method 700 commences at step 702 
where a desired clip, being a portion of footage between a single start-stop transition, is 
selected for processing. A number of clips may be processed in sequence to create a 
production. This is followed by step 704 where a desired template is selected for 
application to the clip. A template in this regard is a set of editing rules that may be 
applied to various shot and clip types to achieve a desired visual effect. Alternatively, a 
template need only be applied to a portion of a clip, or in some instances one or still 
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images or video extracts for which processing is desired. Typically a number of 
templates 706 are available for selection 708. Each template 706 may be established as a 
Boolean set of rules each with a number of default settings. An example template is 
depicted in Table 1 below and which defines particular visual effects that are top be 
applied to particular shot types. 

Table 1 



Template #2 


Effect 








Shot 
type 


Select 


Speed of replay 


B&W 


Zoom 
time 


Color 
filter 


Sound 


etc. 


x% 


x'/ 2 


xl 


x2 


x4 












ECU 


1 


1 










1 


0 


1 


0 




CU 


1 


1 










1 


0 


1 


0 




MCU 


1 






1 






1 


+2 


1 


0 




MS 


0 






















MLS 


• 0 






















LS 


0 






















Other#l 


1 










1 


1 


0 


1 


1 




Other#2 


0 























In the template of Table 1, the various shot types are listed based upon face 
0 detection criteria described above. Two "other" shot types are shown, these for example 
being where no face is detected or some other detectable event may be determined. Such 
for example may be frames containing a white coloured motor racing car of particular 
interest to the user, as compared to other coloured racing cars that may have been 
captured. Such a racing car may be detected by the classification system 522 being 
,5 arranged to detect both a substantial region of the colour white and also substantial 
movement of that colour thereby permitting such frames to be classified as "Other#l" 
The movement may be actual movement of the racing car across the frame over a series 
of adjacent frames, or relative movement where the racing car appears substantially 
stationary within the series of adjacent frames, whilst substantial movement of the 
20 background occurs. Such a classification may be formed independent of the ECU, CU, 
MCU etc. approach described above. As seen from Table 1, each of ECU, CU, MCU and 
Other* 1 shot types are selected for inclusion in the edited presentation. 
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The template (ie. template #2) selected 710 may altered according to a user 
determination made in step 712. Where alteration is desired, step 714 follows which 
permits the user to modify the Boolean values within the template table. As seen above, 
those shot types not selected (ie. MS, MLS, LS and Other#2) are disabled from the table, 
as indicated by the shading thereof. Those selected shot types may then have their 
corresponding effects modified by the user. As shown a number of different speeds of 
replay are provide, the selection of one for any shot type disabling the others for the same 
shot type. As seen each of the ECU and CU are selected to replay at quarter speed, 
whereas the MCU replays at natural speed. The racing car captured by the Other#l shot 
type is selected for replay at four times speed to fulfil the user's desire to accentuate the 
differences between facial and motor car shots. Each of the selected shots has a 
monochrome (B&W) setting selected, thereby removing colour variation, although a 
colour filter effect has been enabled. Such an effect may provide a constant 
orange/brown tinge to the entire frame and in this example would result in the images 
been reproduced with an aged-sepia effect. Sound is seen disabled on the facial shots but 
enabled on the racing car shots. 

A zoom feature is also provided to permit translations between adjacent shot 
types. As seen in the example of Table 1, MCU shots are subject to a zoom of "+2", this 
notation representing a zoom-in to the next shot type (ie. CU) with the zoom occurring 
over a period of 2 seconds. Typically, during the zoom, the image is automatically 
cropped to retain a size within that of the display. Zoom-outs are also possible and are 
indicated by a minus symbol (-), Durations may be specified in seconds, frames, or as 
being instantaneous (eg. ++), the later directly creating a new frame for inclusion in the 
edited production. The transitions for zoom in Table 1 are specified as occurring between 
adjacent shot types. Alternatively the degree of zoom and the zoom duration may be 
separately specified for each shot type (eg: MCU : 150%: 25 frames; CU : 200%: 10 
frames; ECU : 30% : 50 frames). In this fashion, the edited production may show for a 
particular shot type a zoom to another shot type over a predetermined period thereby 
enhancing the emotional effect of the production. For example, a zoom from an MCU to 
an ECU may form part of a "dramatic" template, being one where ECU'S are used to focus 
the viewer's attention on the central character. A "tribute" template may include a zoom 
from a MCU to a CU. 

Other types of image editing effects may be applied within a template as desired. 
Once modified, the template is stored and control returns to step 704 where the 
user may select the template just modified. Once a template has been selected, step 716 
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follows where the sequence of clips is derived - form the camera metadata retained in the 
store 718. Once the correct sequence is formed, the sequence is edited in step 720 by 
applying the selected template to the sequence. This step involves sourcing firstly the 
classification metadata from the store 718 to determine the shots types and then sourcing 
5 . the video data to which the various effected selected' for that shot may be applied. This 
results ion the output presentation of step 722 which may be sent for storage or directly 
reproduced to a display anangement. 

• It will be appreciated that a variety of templates may be created, each having the 
capacity to impose on the source image data a particular emotive editing style in response 
,0 to the classification of shot types contained therein. Further, individual clips or scenes 
may be edited using different templates thereby altering the presentation style based upon 
the subject matter. Accordingly, a family visit to the motor races may include scenes 
depicting a picnic lunch using substantially natural footage but limited to MS 's and 
MLS's action scenes edited in the manner described above with respect to Table 1 , and 
,5 super-action scenes where substantial slow motion is used to accentuate a crash during the 
race The crash may be classified by the user supplementing the metadata of that portion 
of footage with a tag indicating importance. Also, whilst the template of Table 1 relies 
predominantly on shot distance, other classifications such as tilt angle as discussed above 
may alternatively or additionally be included. 
2Q Industrial Applicability 

The arrangements described are applicable to the image editing and reproduction 
industries and find particular application with amateur movie makers who are trained in 
the intricacies of shot and subject identification, and consequential editing based 
thereupon. 

25 The foregoing describes only some embodiments of the present invention, and 

modifications and/or changes can be made thereto without departing from the scope and 
spirit of the present invention, the described embodiments being illustrative and not 
restrictive. 

In the context of this specification, the word "comprising" means "including 
30 principally but not necessarily solely" or "having" or "including" and not "consisting only 
or. Variations of the word comprising, such as "comprise" and "comprises" have 
corresponding meanings. 



529302.doc 



- 16- 

The claims defining the invention are as follows: 

1. A method for automated classification .of a digital image, said method 
comprising the steps of: • 
analysing said image for the presence of a human face; 

determining a size of the located face with respect to a size of said image; and 
classifying said image based on the relative size of said face with respect to said 

image. 

2. A method according to claim 1 wherein said image is classified using a term 
which provides information about an intention of a photographer whom captured said 
image. 

3. A method according to claim 1 or 2 wherein said image is classified as a far-shot 
if the size of said located face is substantially less than the size of said image. 

4. A method according to claim 1 or 2 wherein said image is classified as a close- 
up where the size of said located face substantially corresponds with the size of said 
image 

5. A method according to claim 1 or 2 wherein said image is classified as an 
extreme close-up where only a part of said located face appears within said image. 

6. A method according to claim 1 or 2 wherein said classifying comprises 
associating a size of said located face with a set of predetermined thresholds for a size of 
a human face image. 

7. A method according to claim 1 or 2 wherein said image is classified as a far shot 
if said image contains a face and the size of said located face is below a first 
predetermined threshold compared to the size of said image. 

8. A method according to claim 7 wherein said image is classified as an extreme 
close up if the size of said located face is above a second predetermined threshold 
compared to the size of said image. 
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9. A method according to claim 8 wherein said image is classified as a close-up if 

the size of said located face is below said second predetermined threshold and above a 
third predetermined threshold compared to the size of said image. 

10 • A method according to claim 9 wherein said image is classified is a medium shot 
if the size of said located face is greater than said first predetermined threshold and less 
than said third predetermined threshold. 

11. A method according to claim 1 wherein said analysing comprises interpreting 
information provided with said image. 

12. A method according to claim 11 wherein said image comprises a frame of a 
digital video sequence of images. 

; 13. A method according to claim 12 wherein said information is associated with 
other frames of said sequence. 

14 A method according to claim 1 wherein said analysing comprises detecting one 
or more regions of said image at which skin coloured pixels are located in order to locate 

20 said face. 

1 5 A method according to claim 1 wherein said determining approximates the size 
of said located face by a height and width of a bounding rectangle that encloses said face. 

25 16. A method for automated classification of a digital image, said method 

comprising the steps of: 

analysing said image for the presence of a human face; 

determining a position of the located face with respect to a frame of said image; 

and 

classifying said image based on the relative position of said face w,th respect to 



30 



said image frame. 



17. 



A method according to claim 16 wherein said image is classified using a term 
which provides information about an intention of a photographer whom captured said 
35 image. 
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18. A method according to claim 1 6 or 1 7 wherein said image is classified as a high- 
shot if the position of said located face is substantially toward a bottom of said image 
frame. 

19. A method according to claim 16 or 17 wherein said image is classified as a eye- 
level shot where the position of said located face substantially corresponds with a centre 
of said image frame. 

20. A method according to claim 16 or 17 wherein said image is classified as a low 
shot where the position of said located face is substantially toward a iop of said image 
frame. 

21. A method according to claim 16 or 17 wherein said image is classified as a left 
shot where the position of said located face is substantially toward a right hand side of 
said image frame. 

22. A method according to claim 16 or 17 wherein said image is classified as a right 
shot where the position of said located face is substantially toward a left hand side of said 
image frame. 

23. A method according to claim 16 or 17 wherein said image is classified as a low 
shot where the position of said located face is substantially toward a top of said image 
frame. 

24. A method according to claim 16 wherein said analysing comprises interpreting 
information provided with said image. 

25. A method according to claim 16 wherein said image comprises a frame of a 
digital video sequence of images. 

26. A method according to claim 25 wherein said information is associated with 
other frames of said sequence. 



27. 



A method according to claim 1 further comprising the steps of: 
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detecting an edge within said image; 
• determining an angle of inclination between said edge and an axis of said image 

frame; ■ 

classifying said image as a Dutch shot where said angle of inclination is between 
predetermined angles of inclination. 

28. A method according to claim 27 wherein said predetermined angles of 
inclination comprise 30 and 60 degrees. 

29. A method according to claim 1 6 further comprising: 

analysing said image for the presence of a predetermined non-human component; 
assessing said predetermined component with respect to at least one further 
criteria; and 

where said criteria is met, classifying said image based upon the presence of said 
predetermined component. 

30. A method according to claim 29 wherein said predetermined component 
comprises a colour of a distinct region of said image. 

31. A method according to claim 29 wherein said criteria comprises at least a 
relative motion of said predetermined component within said image. 

32. A method of processing an input sequence of images, said method comprising 
the steps of: 

classifying each said image of said sequence using a method according to 
claim 1 ; and 

editing said sequence using said classification to form an output sequence of 

images. 

33. A method according to claim 3 1 wherein said editing comprises applying an edit 
function to each said image of said input sequence, those ones of said images not 
satisfying said edit function being omitted from said output sequence. 
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34. A method according to claim 34 wherein said editing comprises establishing an 
editing template for said sequence, each said edit function forming a component of said 
template and corresponding to one of said image classifications. 

35., A method according to claim 33 wherein said edit function comprises at least 
one effect for application to the image, said effect being selected from the group 
consisting of visual effects and audible effects. 

36. A method according to claim 35 wherein said visual effects are selected from the 
group consisting of reproduction speed variation, zooming, blurring, and colour variation. 

37. Apparatus for automated classification of a digital image, said comprising: 

means for analysing said image for the presence of a human face; 

means for deteimining a size of the located face with respect to a size of said 
image; and 

means for classifying said image based on the relative size of said face with 
respect to said image. 

38. Apparatus according to claim 37 wherein: 

(i) said image is classified as a far-shot if the size of said located face is 
substantially less than the size of said image; 

(ii) said image is classified as a close-up where the size of said located face 
substantially corresponds with the size of said image; and 

(iii) said image is classified as an extreme closc-up where only a part of said 
located face appears within said image. 

39. Apparatus according to claim 37 wherein said means for classifying associates a 
size of said located face with a set of predetermined thresholds for a size of a human face 
image. 



40. Apparatus according to claim 39 wherein: 

(i) said image is classified as a far shot if said image contains a face and 
the size of said located face is below a first predetermined threshold compared to the size 
of said image; 
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(ii) said image is classified as an extreme close up if the size of said located 
face is above a second predetermined threshold compared to the size of said image; 

.(Hi) ' said image is classified as a close.up if the size of said located face is 
below said second predetermined threshold and above a third predetermined threshold 

5 compared to the size of said image; and 

( iv> said image is classified is a medium shot if the size of said located face 
is greater than said first predetermined threshold and less than said third predetermined 
threshold. 

10 41. Apparatus according to claim 37 wherein said analysing comprises interpreting 
information provided with said image. 

42. Apparatus according to claim 41 wherein said image comprises a frame of a 
digital video sequence of images. 

43 Apparatus according to claim 41 wherein said means for analysing detects one or 
more regions of said image at which skin coloured pixels are located in order to locate 
said face. 

20 44 Apparatus according to claim 43 wherein said means for determining 
approximates the size of said located face by a height and width of a bounding rectangle 
that encloses said face. 

45 . Apparatus for automated classification of a digital image, said apparatus 

25 comprising: 

means for analysing said image for the presence of a human face; 

means for determining a position of the located face with respect to a frame of 

said image; and 

means for classifying said image based on the relative position of sa ld face with 
30 respect to said image frame. 

46 Apparatus according to claim 45 wherein: 

(i) said image is classified as a high-shot if the position of said located face 

is substantially toward a bottom of said image frame; 
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(ii) said image is classified as a eye-level shot where the position of said 

• located face substantially corresponds with a centre of said image frame; 

. (hi) said image is classified as a low shot where the position of said located 

• face is substantially toward a top of said image frame; 

(iv) said image is classified as a left shot where the position of said located 
face is substantially toward a right hand side of said image frame; 

(v) said image is classified as a right shot where the position of said located 
face is substantially toward a lefi hand side of said image frame; 

(vi) said image is classified as a low shot where the position of said located 
face is substantially toward a top of said image frame. 

47. Apparatus according to claim 46 wherein said analysing comprises interpreting 
information provided with said image. 

48. Apparatus according to claim 46 wherein said image comprises a frame of a 
digital video sequence of images. 

49. Apparatus according to claim 48 wherein said information is associated with 
other frames of said sequence. 

50. Apparatus according to claim 37 further comprising: 
means for detecting an edge within said image; 

means for determining an angle of inclination between said edge and an axis of 
said image frame; 

means for classifying said image as a Dutch shot where said angle of inclination 
is between predetermined angles of inclination. 

51. Apparatus according to claim 37 further comprising: 

means for analysing said image for the presence of a predetermined non-human 
component; 

means for assessing said predetermined component with respect to at least one 
further criteria; and 

where said criteria is met, classifying said image based upon the presence of said 
predetermined component. 
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52. Apparatus according to claim 51 wherein said predetermined component 
comprises a colour of a distinct region of said image. 

53. -Apparatus according to claim 51 wherein said criteria comprises at least a 
relative motion of said predetermined component within said image. 

54. Apparatus for processing an image sequence, said apparatus comprising: 
classification apparatus according to claim 37 for determining a classification for 

each image of said sequence; and 

means for editing said sequence using said classification to form an output 

sequence of images. 



55. Apparatus according to claim 54 wherein said means for editing comprises 
applying an edit function to each said image of said input sequence, those ones of said 

15 images not satisfying said edit function being omitted from said output sequence. 

56. Apparatus according to claim 55 wherein said editing comprises establishing an 
editing template for said sequence, each said edit function forming a component of said 
template and corresponding to one of said image classifications. 



57. Apparatus according to claim 56 wherein said edit function comprises at least 
one effect for application to the image, said effect being selected from the group 
consisting of visual effects and audible effects. 

25 58. Apparatus according to claim 57 wherein saidiyisual effects are selected from the 
group consisting of reproduction speed variation, zoorajng, blurring, and colour variation. 

59. A computer readable medium incorporating a computer program product 
operable upon computer apparatus for automated classification of a digital image, said 
30 computer program product comprising: 

code for analysing said image for the presence of a human face; 

code for determining a size of the located face with respect to a size of said 

image; and 

code for classifying said image based on the relative size of said face with 
35 respect to said image. 
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60. A computer readable medium according to claim 59 wherein: 

.(0 said image is classified as a far-shot if the size of said located face is 
substantially less than the size of said image; 

(ii) said image is classified as a close-up where the size of said located face 
substantially corresponds with the size of said image; and 

(iii) said image is classified as an extreme close-up where only a part of said 
located face appears within said image. 

61. A computer readable medium according to claim 60 wherein said classifying 
comprises associating a size of said located face with a set of predetermined thresholds 
for a size of a human face image. 

62. A computer readable medium according to claim 61 wherein: 

(i) said image is classified as a far shot if said image contains a face and 
the size of said located face is below a first predetermined threshold compared to the size 
of said image; 

(ii) said image is classified as an extreme close up if the size of said located 
face is above a second predetermined threshold compared to the size of said image; 

(iii) said image is classified as a close-up if the size of said located face is 
below said second predeteimined threshold and above a third predetermined threshold 
compared to the size of said image; and 

(iv) said image is classified is a medium shot if the size of said located face 
is greater than said first predetermined threshold and less than said third predetermined 
threshold. 



63. A computer readable medium according to claim 59 wherein said analysing 
comprises interpreting information provided with said image. 

64. A computer readable medium according to claim 63 wherein said image 
comprises a frame of a digital video sequence of images. 

65. A computer readable medium according to claim 64 wherein said information is 
associated with other frames of said sequence. 
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66. A computer readable medium according to claim 59 wherein said analysing 
comprises detecting one or more regions of said image at which skin coloured pixels are 
located in order to locate said face. 

67. A computer readable medium according to claim 59 wherein said determining 
approximates the size of said located face by a height and width of a bounding rectangle 
that encloses said face. 

68. A computer readable medium according to claim 59 further comprising: 
code for analysing said image for the presence of a human face; 

code for determining a position of the located face with respect to a frame of said 

image; and 

code for classifying said image based on the relative position of said face with 
respect to said image frame. 

69. A computer readable medium according to claim 68 wherein: 

(i) said image is classified as a high-shot if the position of said located face 
is substantially toward a bottom of said image frame; 

(ii) said image is classified as a eye-level shot where the position of said 
located face substantially corresponds with a centre of said image frame; 

(iii) said image is classified as a low shot where the position of said located 
face is substantially toward a top of said image frame; 

(iv) said image is classified as a left shot where the position of said located 
face is substantially toward a right hand side of said image frame; 

(v) said image is classified as a right shot where the position of said located 
face is substantially toward a left hand side of said image frame; 

(vi) said image is classified as a low shot where the position of said located 
face is substantially toward a top of said image frame. 

70. A computer readable medium according to claim 69 wherein said analysing 
comprises interpreting information provided with said image. 

71. A computer readable medium according to claim 69 wherein said image 
comprises a frame of a digital video sequence of images. 
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72. A computer readable medium according to claim 71 wherein said information is 
associated with other frames of said sequence. 

i u 

1 

73. A computer readable medium according to claim 72 further comprising: 
code for detecting an edge within said image; 

code for determining an angle of inclination between said edge and an axis of 
said image frame; 

code for classifying said image as a Dutch shot where said angle of inclination is 
between predetermined angles of inclination. 

74. A computer readable medium according to claim 73 wherein said predetermined 
angles of inclination comprise 30 and 60 degrees. 

75. A computer readable medium according to claim 74 further comprising: 

code for analysing said image for the presence of a predetermined non-human 
component; 

code for assessing said predetermined component with respect to at least one 
further criteria; and 

where said criteria is met, classifying said image based upon the presence of said 
predetermined component. 

76. A computer readable medium according to claim 75 wherein said predetermined 
component comprises a colour of a distinct region of said image. 

77. A computer readable medium according to claim 76 wherein said criteria 
comprises at least a relative motion of said predetermined component within said image. 

78. A computer readable medium incorporating a computer program product for 
processing an input sequence of images, comprising: 

code for classifying each said image of said sequence using the computer 
program product of claim 77; and 

code for editing said sequence using said classification to form an output 
sequence of images. 
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79. A computer readable medium according to claim 78 wherein said editing 
comprises applying an edit function to each said image of said input sequence, those ones 
of said images not satisfying said edit function being-omitted from said output sequence. 

80. • A computer readable medium according to claim 79 wherein said editing 
comprises establishing an editing template for said sequence, each said edit function 
forming a component of said template and corresponding to one of said image 
classifications. 

81. A computer readable medium according to claim 80 wherein said edit function 
comprises at least one effect for application to the image, said effect being selected from 
the group consisting of visual effects and audible effects. 

82. A computer readable medium according to claim 81 wherein said visual effects 
are selected from the group consisting of reproduction speed variation, zooming, blurring, 
and colour variation. 

83. An edited sequence of images formed through implementation of a series of 
images according to any one of the preceding claims.. 

84. A method for automated classification of a digital image substantially as 
described herein with reference to any one of the embodiments of the method as that 
embodiment is illustrated in the drawings. 

85. A method of editing images substantially as described herein with reference to 
any one of the embodiments of the method as that embodiment is illustrated in the 
drawings. 



DATED this FIFTH Day of DECEMBER 2000 
CANON KABUSHIKI KAISHA 
Patent Attorneys for the Applicant 
SPRUSON&FERGUSON 
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